p <- plot(w)
p$Categoricalp$Continuous This article outlines analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with Rmarkdown and now Quarto. I start by covering the creation of annotated analysis files and running descriptive statistics on them with goals of understanding the data and the quality and completeness of the data. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement and to produce tabular and graphical statistical summaries. Several examples of processing and manipulating data using the data.table package are given. Finally, examples of caching results and doing parallel processing are presented.
I have a directory for each major project, and put everything in that directory (including data files) except for graphics files for figures, which are placed in their own subdirectory underneath the project folder. The directory name for the project includes key identifying information, and files within that directory do not contain project information in their names, nor do they contain dates, unless I want to freeze an old version of an analysis script.
Quarto I specify that html files are to be self-contained, so there are no separate graphics files.For multi-analyst projects or ones in which you want to capture the entire code history, having the project on github is worthwhile.
I typically create a compact analysis file in a separate R script called create.r and have it produce compressed R binary data.frame or data.table .rds files using saveRDS(name, 'name.rds', compress='xz'). Then I have an analysis script named for example a.qmd (for Quarto reports) or a.Rmd (for RMarkdown reports) that starts with d <- readRDS('name.rds').
When variables need to be recoded, have labels added or changed, or have units of measurement added, I specify those using the Hmisc package upData function. Here is an example:
describe and contents function outputs below and in axis labels for graphics.# Function to recode from atypical coding for yes/no in raw data
yn <- function(x) factor(x, 0:1, c('yes', 'no'))
d <-
upData(d,
rename = c(gender='sex', any.event='anyEvent'),
posSE = yn(posSE),
newMI = yn(newMI),
newPTCA = yn(newPTCA),
newCABG = yn(newCABG),
death = yn(death),
hxofHT = yn(hxofHT),
hxofDM = yn(hxofDM),
hxofCig = factor(hxofCig, c(0, 0.5, 1),
c('heavy', 'moderate', 'non-smoker')),
hxofMI = yn(hxofMI),
hxofPTCA = yn(hxofPTCA),
hxofCABG = yn(hxofCABG),
chestpain= yn(chestpain),
anyEvent = yn(anyEvent),
drop=Cs(event.no,phat,mics,deltaEF, # Cs in Hmisc; auto quotes
newpkmphr,gdpkmphr,gdmaxmphr,gddpeakdp,gdmaxdp,
hardness),
labels=c(
bhr ='Basal heart rate',
basebp ='Basal blood pressure',
basedp ='Basal Double Product bhr*basebp',
age ='Age',
pkhr ='Peak heart rate',
sbp ='Systolic blood pressure',
dp ='Double product pkhr*sbp',
dose ='Dose of dobutamine given',
maxhr ='Maximum heart rate',
pctMphr='Percent maximum predicted heart rate achieved',
mbp ='Maximum blood pressure',
dpmaxdo='Double product on max dobutamine dose',
dobdose='Dobutamine dose at max double product',
baseEF ='Baseline cardiac ejection fraction',
dobEF ='Ejection fraction on dobutamine',
chestpain='Chest pain',
ecg ='Baseline electrocardiogram diagnosis',
restwma ='Resting wall motion abnormality on echocardiogram',
posSE ='Positive stress echocardiogram',
newMI ='New myocardial infarction',
newPTCA ='Recent angioplasty',
newCABG ='Recent bypass surgery',
hxofHT ='History of hypertension',
hxofDM ='History of diabetes',
hxofMI ='History of myocardial infarction',
hxofCig ='History of smoking',
hxofPTCA='History of angioplasty',
hxofCABG='History of coronary artery bypass surgery',
anyEvent='Death, newMI, newPTCA, or newCABG'),
units=c(age='years', bhr='bpm', basebp='mmHg', basedp='bpm*mmHg',
pkhr='mmHg', sbp='mmHg', dp='bpm*mmHg', maxhr='bpm',
mbp='mmHg', dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',
pctMphr='%', dose='mg', dobdose='mg')
)
saveRDS(d, 'stressEcho.rds', compress='xz')
# Note that we could have automatically recoded all 0:1 variables
# if they were all to be treated identically:
for(x in names(d))
if(all(d[[x]] %in% c(0,1))) d[[x]] <- yn(d[[x]])
If reading data exported from REDCap that are placed into the project directory I run the following to get rid of duplicate (factor and non-factor versions of variables REDCap produces) variables and automatically convert dates to Date variables:
require(Hmisc)
getRs('importREDCap.r', put='source') # source() code to define function
mydata <- importREDCap() # by default operates on last downloaded export
saveRDS(mydata, 'mydata.rds', compress='xz')
When file names are not given to importREDCap the function looks for the latest created .csv file and .R file with same prefix and uses those. See this for more information.
The Hmisc package contents function will provide a concise data dictionary. Here is an example using the permanent version of the dataset created above, which can be accessed with the Hmisc getHdata function. The top of the contents output has the number of levels for factor variables hyperlinked. Click on the number to go directly to the list of levels for that variable.
stressEcho coded binary variables as 0/1 instead of N/Y.require(Hmisc)
getHdata(stressEcho)
d <- stressEcho
html(contents(d), levelType='table')| Name | Labels | Units | Levels | Storage |
|---|---|---|---|---|
| bhr | Basal heart rate | bpm | integer | |
| basebp | Basal blood pressure | mmHg | integer | |
| basedp | Basal Double Product bhr*basebp | bpm*mmHg | integer | |
| pkhr | Peak heart rate | mmHg | integer | |
| sbp | Systolic blood pressure | mmHg | integer | |
| dp | Double product pkhr*sbp | bpm*mmHg | integer | |
| dose | Dose of dobutamine given | mg | integer | |
| maxhr | Maximum heart rate | bpm | integer | |
| pctMphr | Percent maximum predicted heart rate achieved | % | integer | |
| mbp | Maximum blood pressure | mmHg | integer | |
| dpmaxdo | Double product on max dobutamine dose | bpm*mmHg | integer | |
| dobdose | Dobutamine dose at max double product | mg | integer | |
| age | Age | years | integer | |
| gender | 2 | integer | ||
| baseEF | Baseline cardiac ejection fraction | % | integer | |
| dobEF | Ejection fraction on dobutamine | % | integer | |
| chestpain | Chest pain | integer | ||
| restwma | Resting wall motion abnormality on echocardiogram | integer | ||
| posSE | Positive stress echocardiogram | integer | ||
| newMI | New myocardial infarction | integer | ||
| newPTCA | Recent angioplasty | integer | ||
| newCABG | Recent bypass surgery | integer | ||
| death | integer | |||
| hxofHT | History of hypertension | integer | ||
| hxofDM | History of diabetes | integer | ||
| hxofCig | History of smoking | 3 | integer | |
| hxofMI | History of myocardial infarction | integer | ||
| hxofPTCA | History of angioplasty | integer | ||
| hxofCABG | History of coronary artery bypass surgery | integer | ||
| any.event | Death, newMI, newPTCA, or newCABG | integer | ||
| ecg | Baseline electrocardiogram diagnosis | 3 | integer |
| Variable | Levels |
|---|---|
| gender | male |
| female | |
| hxofCig | heavy |
| moderate | |
| non-smoker | |
| ecg | normal |
| equivocal | |
| MI |
The Hmisc describe function is my main tool for getting initial descriptive statistics and quality controlling the data in a univariate fashion. Here is an example. The Info index is a measure of the information content in a numeric variable relative to the information in a continuous numeric variable with no ties. A very low value of Info will occur when a highly imbalanced variable is binary.
Info comes from the approximate formula for the variance of a log odds ratio for a proportional odds model/Wilcoxon test, due to Whitehead.Gmd stands for Gini’s mean difference—the mean absolute difference over all possible pairs of different observations. It is a very interpretable measure of dispersion that is more robust than the standard deviation.w <- describe(d)
html(w, scroll=TRUE, rows=50)| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 69 | 0.999 | 75.29 | 16.57 | 54.0 | 58.0 | 64.0 | 74.0 | 84.0 | 95.3 | 102.0 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 94 | 0.998 | 135.3 | 23.35 | 104.0 | 110.0 | 120.0 | 133.0 | 150.0 | 162.3 | 170.1 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 441 | 1 | 10181 | 2813 | 6607 | 7200 | 8400 | 9792 | 11663 | 13610 | 14770 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 105 | 1 | 120.6 | 25.36 | 81.85 | 90.70 | 106.25 | 122.00 | 135.00 | 147.00 | 155.15 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 142 | 1 | 146.9 | 40.72 | 96 | 102 | 120 | 141 | 170 | 200 | 210 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 508 | 1 | 17634 | 5765 | 10256 | 11341 | 14033 | 17060 | 20644 | 24536 | 26637 |
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 558 | 0 | 7 | 0.84 | 33.75 | 8.334 |
Value 10 15 20 25 30 35 40 Frequency 2 28 47 56 64 61 300 Proportion 0.004 0.050 0.084 0.100 0.115 0.109 0.538
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 103 | 1 | 119.4 | 24.64 | 82.0 | 91.0 | 104.2 | 120.0 | 133.0 | 146.0 | 154.1 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 78 | 0.999 | 78.57 | 16.86 | 53 | 60 | 69 | 78 | 88 | 97 | 104 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 132 | 0.999 | 156 | 35.03 | 110.0 | 120.0 | 133.2 | 150.0 | 175.8 | 200.0 | 211.1 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 484 | 1 | 18550 | 5385 | 11346 | 12865 | 15260 | 18118 | 21239 | 24893 | 27477 |
| n | missing | distinct | Info | Mean | Gmd |
|---|---|---|---|---|---|
| 558 | 0 | 8 | 0.941 | 30.24 | 10.55 |
Value 5 10 15 20 25 30 35 40 Frequency 7 7 55 73 71 78 62 205 Proportion 0.013 0.013 0.099 0.131 0.127 0.140 0.111 0.367
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 54 | 0.994 | 55.6 | 10.71 | 32 | 40 | 52 | 57 | 62 | 65 | 66 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 60 | 0.992 | 65.24 | 12.38 | 40.0 | 49.7 | 62.0 | 67.0 | 73.0 | 76.0 | 80.0 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.64 | 172 | 0.3082 | 0.4272 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.745 | 257 | 0.4606 | 0.4978 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.553 | 136 | 0.2437 | 0.3693 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.143 | 28 | 0.05018 | 0.09549 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.138 | 27 | 0.04839 | 0.09226 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.167 | 33 | 0.05914 | 0.1115 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.123 | 24 | 0.04301 | 0.08247 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.625 | 393 | 0.7043 | 0.4173 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.699 | 206 | 0.3692 | 0.4666 |
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 3 |
Value heavy moderate non-smoker Frequency 122 138 298 Proportion 0.219 0.247 0.534
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.599 | 154 | 0.276 | 0.4004 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.204 | 41 | 0.07348 | 0.1364 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.399 | 88 | 0.1577 | 0.2661 |
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 558 | 0 | 2 | 0.402 | 89 | 0.1595 | 0.2686 |
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 3 |
Value normal equivocal MI Frequency 311 176 71 Proportion 0.557 0.315 0.127
There is also a plot method for describe output. It produces two graphics objects: one for categorical variables and one for continuous variables. The default is to use ggplot2 to produce static graphics.
p <- plot(w)
p$Categoricalp$ContinuousBy specifying grType option you can instead get plotly graphics that use hover text to show more information, especially when hovering over the leftmost dot or tick mark for a variable.
options(grType='plotly')
p <- plot(w)
p$Categoricalp$ContinuousSee this for other Hmisc functions for descriptive graphics and tables, especially for stratified descriptive statistics for categorical variables. The summaryM function prints a tabular summary of a mix of continuous and categorical variables. Here is an example where stratification is by history of myocardial infarction (MI).
require(data.table)
setDT(d) # turn d into a data table
w <- d
w[, hxofMI := factor(hxofMI, 0 : 1, c('No history of MI', 'History of MI'))]
vars <- setdiff(names(d), 'hxofMI')
form <- as.formula(paste(paste(vars, collapse='+'), '~ hxofMI'))
print(form)bhr + basebp + basedp + pkhr + sbp + dp + dose + maxhr + pctMphr + mbp + dpmaxdo + dobdose + age + gender + baseEF + dobEF + chestpain + restwma + posSE + newMI + newPTCA + newCABG + death + hxofHT + hxofDM + hxofCig + hxofPTCA + hxofCABG + any.event + ecg ~ hxofMI
s <- summaryM(form, data=d, test=TRUE)
html(s, exclude1=TRUE, npct='both', digits=3, middle.bold=TRUE)| Descriptive Statistics (N=558). | |||
| No history of MI N=404 |
History of MI N=154 |
Test Statistic |
|
|---|---|---|---|
Basal heart rate bpm |
65 74 85 | 63 72 84 | F1 556=1.41, P=0.2351 |
Basal blood pressure mmHg |
120 134 150 | 120 130 150 | F1 556=1.39, P=0.2381 |
Basal Double Product bhr*basebp bpm*mmHg |
8514 9874 11766 | 8026 9548 11297 | F1 556=3.32, P=0.0691 |
Peak heart rate mmHg |
107 123 136 | 104 120 132 | F1 556=2.35, P=0.1261 |
Systolic blood pressure mmHg |
124 146 174 | 115 134 158 | F1 556=12.1, P<0.0011 |
Double product pkhr*sbp bpm*mmHg |
14520 17783 21116 | 13198 15539 18885 | F1 556=15, P<0.0011 |
Dose of dobutamine given mg : 10 |
0.00 2⁄404 | 0.00 0⁄154 | χ26=8.77, P=0.1872 |
| 15 | 0.05 21⁄404 | 0.05 7⁄154 | |
| 20 | 0.10 40⁄404 | 0.05 7⁄154 | |
| 25 | 0.11 45⁄404 | 0.07 11⁄154 | |
| 30 | 0.11 43⁄404 | 0.14 21⁄154 | |
| 35 | 0.11 45⁄404 | 0.10 16⁄154 | |
| 40 | 0.51 208⁄404 | 0.60 92⁄154 | |
Maximum heart rate bpm |
107 122 134 | 102 118 130 | F1 556=4.05, P=0.0451 |
Percent maximum predicted heart rate achieved % |
69.0 78.0 89.0 | 70.0 77.0 87.5 | F1 556=0.5, P=0.4791 |
Maximum blood pressure mmHg |
138 154 180 | 130 142 162 | F1 556=13, P<0.0011 |
Double product on max dobutamine dose bpm*mmHg |
15654 18666 21664 | 14489 16785 19680 | F1 556=16.1, P<0.0011 |
Dobutamine dose at max double product mg : 5 |
0.01 4⁄404 | 0.02 3⁄154 | χ27=8.5, P=0.292 |
| 10 | 0.01 6⁄404 | 0.01 1⁄154 | |
| 15 | 0.11 43⁄404 | 0.08 12⁄154 | |
| 20 | 0.14 58⁄404 | 0.10 15⁄154 | |
| 25 | 0.13 54⁄404 | 0.11 17⁄154 | |
| 30 | 0.13 51⁄404 | 0.18 27⁄154 | |
| 35 | 0.10 40⁄404 | 0.14 22⁄154 | |
| 40 | 0.37 148⁄404 | 0.37 57⁄154 | |
Age years |
59.0 68.0 75.0 | 63.2 71.0 76.8 | F1 556=9.75, P=0.0021 |
| gender : female | 0.68 273⁄404 | 0.42 65⁄154 | χ21=30, P<0.0012 |
Baseline cardiac ejection fraction % |
55 59 63 | 40 54 60 | F1 556=56.4, P<0.0011 |
Ejection fraction on dobutamine % |
65.0 70.0 74.2 | 50.0 64.5 70.0 | F1 556=50.3, P<0.0011 |
| Chest pain | 0.29 119⁄404 | 0.34 53⁄154 | χ21=1.29, P=0.2572 |
| Resting wall motion abnormality on echocardiogram | 0.57 230⁄404 | 0.18 27⁄154 | χ21=69.7, P<0.0012 |
| Positive stress echocardiogram | 0.21 86⁄404 | 0.32 50⁄154 | χ21=7.56, P=0.0062 |
| New myocardial infarction | 0.03 14⁄404 | 0.09 14⁄154 | χ21=7.4, P=0.0072 |
| Recent angioplasty | 0.02 10⁄404 | 0.11 17⁄154 | χ21=17.8, P<0.0012 |
| Recent bypass surgery | 0.05 21⁄404 | 0.08 12⁄154 | χ21=1.35, P=0.2462 |
| death | 0.04 15⁄404 | 0.06 9⁄154 | χ21=1.23, P=0.2672 |
| History of hypertension | 0.69 280⁄404 | 0.73 113⁄154 | χ21=0.89, P=0.3462 |
| History of diabetes | 0.36 147⁄404 | 0.38 59⁄154 | χ21=0.18, P=0.6742 |
| History of smoking : heavy | 0.21 83⁄404 | 0.25 39⁄154 | χ22=3.16, P=0.2062 |
| moderate | 0.24 96⁄404 | 0.27 42⁄154 | |
| non-smoker | 0.56 225⁄404 | 0.47 73⁄154 | |
| History of angioplasty | 0.04 15⁄404 | 0.17 26⁄154 | χ21=28.4, P<0.0012 |
| History of coronary artery bypass surgery | 0.08 34⁄404 | 0.35 54⁄154 | χ21=59.6, P<0.0012 |
| Death, newMI, newPTCA, or newCABG | 0.12 48⁄404 | 0.27 41⁄154 | χ21=18.1, P<0.0012 |
| Baseline electrocardiogram diagnosis : normal | 0.59 240⁄404 | 0.46 71⁄154 | χ22=8, P=0.0182 |
| equivocal | 0.29 117⁄404 | 0.38 59⁄154 | |
| MI | 0.12 47⁄404 | 0.16 24⁄154 | |
|
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. Tests used: 1Wilcoxon test; 2Pearson test . | |||
plot(s, which='categorical', vars=1 : 4) # plot first 4 categorical variablesplot(s, which='continuous', vars=1 : 4) # plot first 4 continuous variablesSemi-interactive stratified spike histograms are also useful descriptive plots. These plots also contain a superset of the quantiles used in box plots, and the legend is clickable, allowing any of the statistical summaries to be turned off.
d[, histboxp(x=maxhr, group=ecg, bins=200)]The data.table package provides a concise, consistent syntax for managing simple and complex data manipulation tasks, and it is extremely efficient for large datasets. One of the best organized tutorials is this, and a master cheatsheet for data.table is here from which the general syntax below is taken, where DT represents a data table.
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
Data tables are also data frames so work on any method handling data frames. But data tables do not contain rownames.
Several data.table examples follow. I like to hold the current dataset in d to save typing. To facilitate some operations requiring variable names to be quoted, define a function .q to quote them automatically (like the Hmisc function Cs).
.q <- function(...) as.character(sys.call())[-1]
.q(a,b,c)[1] "a" "b" "c"
d[, html(describe(age))]| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
d[, html(describe(~ age + gender))]| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
d[gender == 'female', html(describe(age))] # analyze age for females| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 338 | 0 | 58 | 0.999 | 67.01 | 13.74 | 47.00 | 50.70 | 59.25 | 68.00 | 76.00 | 83.00 | 85.00 |
html(describe(d[, .(age, gender)], 'Age and Gender Stats'))| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 558 | 0 | 62 | 0.999 | 67.34 | 13.41 | 46.85 | 51.00 | 60.00 | 69.00 | 75.00 | 82.00 | 85.00 |
| n | missing | distinct |
|---|---|---|
| 558 | 0 | 2 |
Value male female Frequency 220 338 Proportion 0.394 0.606
# Separate analysis by female, male. Use keyby instead of by to sort the usual way.
d[, print(describe(age, descript=gender)), by=gender]male : Age [years]
n missing distinct Info Mean Gmd .05 .10
220 0 51 0.999 67.86 12.91 45.95 51.00
.25 .50 .75 .90 .95
62.00 69.00 75.00 81.00 86.00
lowest : 30 34 38 40 43, highest: 88 89 91 92 93
female : Age [years]
n missing distinct Info Mean Gmd .05 .10
338 0 58 0.999 67.01 13.74 47.00 50.70
.25 .50 .75 .90 .95
59.25 68.00 76.00 83.00 85.00
lowest : 26 28 29 33 34, highest: 87 88 89 90 91
Empty data.table (0 rows and 1 cols): gender
# Compute mean and median age by gender
d[, .(Mean=mean(age), Median=median(age)), by=gender] gender Mean Median
1: male 67.85909 69
2: female 67.00888 68
# To create a new subset
w <- d[gender == 'female' & age < 70, ]With data.table you can create a new data table with added variables, or you can add or redefine variables in a data table in place. The latter has major speed and memory efficiency implications when processing massive data tables. Here d refers to a different data table from the one used above.
# Rename a variable
setnames(d, c('gender', 'height'), c('sex', 'ht'))
# Easier:
setnames(d, .q(gender, height), .q(sex, ht))
# Add two new variables, storing result in a new data table
z <- d[, .(bmi=wt / ht ^ 2, bsa=0.016667 * sqrt(wt * ht))]
# Add one new variable in place
d[, bmi := wt / ht ^ 2]
# Add two new variables in place
d[, `:=`(bmi = wt / ht ^ 2, bas=0.016667 * sqrt(wt * ht))]
d[, .q(bmi, bsa) := .(wt / ht ^ 2, 0.016667 * sqrt(wt * ht))]
# Add label & optional units (better to use upData which works on data tables)
adlab <- function(x, lab, un='') {
label(x) <- lab
if(un != '') units(x) <- un
x
}
d[, maxhr := adlab(maxhr, 'Maximum Heart Rate', '/m')]
# Delete a variable (or use upData)
d[, bsa := NULL]
# Delete two variables
d[, `:=`(bsa=NULL, bmi=NULL)]
d[, .q(bsa, bmi) := NULL]
Many applications can use the automatically created data.table object .SD which stands for the data table for the current group being processed. If .SDcols were not specified, all variables would be attempted to be analyzed. Specify a vector of variable names as .SDcols to restrict the analysis. If there were no by variable(s), .SD stands for the whole data table.
# Compute the number of distinct values for all variables
nd <- function(x) length(unique(x))
d[, sapply(.SD, nd)] bhr basebp basedp pkhr sbp dp dose maxhr
69 94 441 105 142 508 7 103
pctMphr mbp dpmaxdo dobdose age gender baseEF dobEF
78 132 484 8 62 2 54 60
chestpain restwma posSE newMI newPTCA newCABG death hxofHT
2 2 2 2 2 2 2 2
hxofDM hxofCig hxofMI hxofPTCA hxofCABG any.event ecg
2 3 2 2 2 2 3
# Same but only for variables whose names contain hx and either D or M
d[, sapply(.SD, nd), .SDcols=patterns('hx', 'D|M')]hxofDM hxofMI
2 2
# Compute means on all numeric variables
mn <- function(x) mean(x, na.rm=TRUE)
d[, lapply(.SD, mn), .SDcols=is.numeric] bhr basebp basedp pkhr sbp dp dose maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
pctMphr mbp dpmaxdo dobdose age baseEF dobEF chestpain
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194 0.3082437
restwma posSE newMI newPTCA newCABG death hxofHT
1: 0.4605735 0.2437276 0.05017921 0.0483871 0.05913978 0.04301075 0.7043011
hxofDM hxofPTCA hxofCABG any.event
1: 0.3691756 0.0734767 0.1577061 0.1594982
# Compute means on all numeric non-binary variables
nnb <- function(x) is.numeric(x) && length(unique(x)) > 2
d[, lapply(.SD, mn), .SDcols=nnb] bhr basebp basedp pkhr sbp dp dose maxhr
1: 75.29032 135.3244 10181.31 120.5502 146.9158 17633.84 33.75448 119.3692
pctMphr mbp dpmaxdo dobdose age baseEF dobEF
1: 78.56989 156 18549.88 30.24194 67.34409 55.60394 65.24194
# Print frequency tables of all categorical variables with > 2 levels
cmult <- function(x) ! is.numeric(x) && length(unique(x)) > 2
tab <- function(x) {
z <- table(x, useNA='ifany')
paste(paste0(names(z), ': ', z), collapse=', ')
}
d[, lapply(.SD, tab), .SDcols=cmult] hxofCig
1: heavy: 122, moderate: 138, non-smoker: 298
ecg
1: normal: 311, equivocal: 176, MI: 71
Tabulate all variables having between 3 and 10 distinct values and create a side effect when data.table is running that makes the summarization function tab store all values and frequencies in a growing list Z so that kable can render a markdown table after we pad columns to the maximum length of all columns (maximum number of distinct values).
# Using <<- makes data.table have a side effect of augmenting Z and
# Align in the global environment
tab <- function(x) {
z <- table(x, useNA='ifany')
i <- length(Z)
Z[[i+1]] <<- names(z)
Z[[i+2]] <<- as.vector(z)
Align <<- c(Align, if(is.numeric(x)) 'r' else 'l', 'r')
length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
Z <- list(); Align <- character(0)
w <- d[, lapply(.SD, tab), .SDcols=discr]
maxl <- max(w)
# Pad shorter vectors with blanks
Z <- lapply(Z, function(x) c(x, rep('', maxl - length(x))))
Z <- do.call('cbind', Z) # combine all into columns of a matrix
colnames(Z) <- rep(names(w), each=2)
colnames(Z)[seq(2, ncol(Z), by=2)] <- 'Freq'
knitr::kable(Z, align=Align)| dose | Freq | dobdose | Freq | hxofCig | Freq | ecg | Freq |
|---|---|---|---|---|---|---|---|
| 10 | 2 | 5 | 7 | heavy | 122 | normal | 311 |
| 15 | 28 | 10 | 7 | moderate | 138 | equivocal | 176 |
| 20 | 47 | 15 | 55 | non-smoker | 298 | MI | 71 |
| 25 | 56 | 20 | 73 | ||||
| 30 | 64 | 25 | 71 | ||||
| 35 | 61 | 30 | 78 | ||||
| 40 | 300 | 35 | 62 | ||||
| 40 | 205 |
A better approach is to let the kables function put together a series of separate markdown tables of different sizes. By using the “updating Z in the global environment” side effect we are able to let data.table output any type of objects of non-conformable dimensions over variables (such as frequency tabulations).
tab <- function(x) {
z <- table(x, useNA='ifany')
i <- length(Z)
w <- matrix(cbind(names(z), z), ncol=2,
dimnames=list(NULL, c(vnames[i+1], 'Freq')))
Z[[i+1]] <<- knitr::kable(w, align=c(if(is.numeric(x)) 'r' else 'l', 'r'))
length(z)
}
discr <- function(x) { i <- length(unique(x)); i > 2 & i < 11 }
Z <- list()
vnames <- names(d[, .SD, .SDcols=discr])
w <- d[, lapply(.SD, tab), .SDcols=discr]
knitr::kables(Z)
|
|
|
|
Use a similar side-effect approach to get separate html describe output by gender.
g <- function(x, by) {
Z[[length(Z) + 1]] <<- describe(x, descript=paste('age for', by))
by
}
Z <- list()
by <- d[, g(age, gender), by=gender]
# Make Z look like describe() output for multiple variables
class(Z) <- 'describe'
attr(Z, 'dimensions') <- c(nrow(d), nrow(by))
attr(Z, 'descript') <- 'Age by Gender'
html(Z)| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 220 | 0 | 51 | 0.999 | 67.86 | 12.91 | 45.95 | 51.00 | 62.00 | 69.00 | 75.00 | 81.00 | 86.00 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 338 | 0 | 58 | 0.999 | 67.01 | 13.74 | 47.00 | 50.70 | 59.25 | 68.00 | 76.00 | 83.00 | 85.00 |
# Compute a 1-valued statistic on multiple variables, by cross-classification
# of two variables. Do this on a subset. .SDcols=a:b uses variables in order
# Use keyby instead of by to order output the usual way
d[age < 70, lapply(.SD, mean), keyby=.(gender, newMI), .SDcols=pkhr:dp] gender newMI pkhr sbp dp
1: male 0 122.0561 147.7664 17836.24
2: male 1 115.6667 140.6667 16437.67
3: female 0 122.8492 150.5084 18509.03
4: female 1 123.5714 171.5714 21506.71
# Compute multiple statistics on one variable
# Note: .N is a special variable: count of obs for current group
d[, .(Max=max(bhr), Min=min(bhr), Mean=mean(bhr), N=.N), by=.(gender, newMI)] gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
# Same result another way
g <- function(x) list(Max=max(x), Min=min(x), Mean=mean(x), N=length(x))
d[, g(bhr), by=.(gender, newMI)] # if g returned a vector instead, use as.list(g(bhr)) gender newMI Max Min Mean N
1: male 0 127 42 74.15385 208
2: male 1 89 59 71.25000 12
3: female 0 210 45 75.96894 322
4: female 1 108 65 79.43750 16
d[, as.list(quantile(bhr)), by=gender] gender 0% 25% 50% 75% 100%
1: male 42 63 72 83 127
2: female 45 65 75 85 210
# Compute multiple statistics on multiple variables
d[, lapply(.SD, quantile), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr pkhr sbp
1: male 42 52.00 80.00
2: male 63 106.75 120.00
3: male 72 123.00 140.00
4: male 83 136.00 166.00
5: male 127 176.00 250.00
6: female 45 61.00 40.00
7: female 65 106.25 122.25
8: female 75 122.00 141.50
9: female 85 134.00 170.00
10: female 210 210.00 309.00
# Similar but put percentile number in front of statistic value
# Do only quartiles
g <- function(x) {
z <- quantile(x, (1:3)/4, na.rm=TRUE)
paste(format(names(z)), format(round(z)))
}
d[, lapply(.SD, g), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr pkhr sbp
1: male 25% 63 25% 107 25% 120
2: male 50% 72 50% 123 50% 140
3: male 75% 83 75% 136 75% 166
4: female 25% 65 25% 106 25% 122
5: female 50% 75 50% 122 50% 142
6: female 75% 85 75% 134 75% 170
# To have more control over labeling and to have one row per sex:
g <- function(x) {
s <- sapply(x, quantile, na.rm=TRUE) # compute quantiles for all variables -> matrix
h <- as.list(s) # vectorizes first
# Cross row names (percentile names) with column (variable) names
# paste(b, a) puts variable name in front of percentile
names(h) <- outer(rownames(s), colnames(s), function(a, b) paste(b, a))
h
}
d[, g(.SD), by=gender, .SDcols=.q(bhr, pkhr, sbp)] gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% pkhr 0% pkhr 25% pkhr 50%
1: male 42 63 72 83 127 52 106.75 123
2: female 45 65 75 85 210 61 106.25 122
pkhr 75% pkhr 100% sbp 0% sbp 25% sbp 50% sbp 75% sbp 100%
1: 136 176 80 120.00 140.0 166 250
2: 134 210 40 122.25 141.5 170 309
# Restrict to variables bhr - basedp in order columns created in data table
d[, g(.SD), by=gender, .SDcols=bhr : basedp] gender bhr 0% bhr 25% bhr 50% bhr 75% bhr 100% basebp 0% basebp 25%
1: male 42 63 72 83 127 85 120.00
2: female 45 65 75 85 210 90 122.25
basebp 50% basebp 75% basebp 100% basedp 0% basedp 25% basedp 50% basedp 75%
1: 130 145 203 5220 7976.25 9483 11183.50
2: 136 150 201 5000 8562.00 10063 11891.25
basedp 100%
1: 16704
2: 27300
# Can put ! in front of a sequence of variables to do the opposite
# To add duplicated means to raw data use e.g.
# d[, Mean := mean(x), by=sex]Consider a baseline dataset b and a longitudinal dataset L, with subject ID of id.
b <- data.table(id=1:4, age=c(21, 28, 32, 23), key='id')
L <- data.table(id = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
y = 1 : 12, key='id')
b id age
1: 1 21
2: 2 28
3: 3 32
4: 4 23
L id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 3 1 4
5: 3 2 5
6: 3 3 6
7: 3 4 7
8: 4 1 8
9: 4 2 9
10: 5 1 10
11: 5 2 11
12: 5 3 12
# Merge b and L to look up baseline age and associate it with all follow-ups
b[L, on=.(id)] # Keeps all ids in L (left inner join) id age day y
1: 2 28 1 1
2: 2 28 2 2
3: 2 28 3 3
4: 3 32 1 4
5: 3 32 2 5
6: 3 32 3 6
7: 3 32 4 7
8: 4 23 1 8
9: 4 23 2 9
10: 5 NA 1 10
11: 5 NA 2 11
12: 5 NA 3 12
L[b, on=.(id)] # Keeps all ids in b (right inner join) id day y age
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
L[b, on=.(id), nomatch=NULL] # Keeps only ids in both b and L (right outer join) id day y age
1: 2 1 1 28
2: 2 2 2 28
3: 2 3 3 28
4: 3 1 4 32
5: 3 2 5 32
6: 3 3 6 32
7: 3 4 7 32
8: 4 1 8 23
9: 4 2 9 23
uid <- unique(c(b[, id], L[, id]))
L[b[.(uid), on=.(id)]] # Keeps ids in either b or c (full outer join) id day y age
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
11: 5 1 10 NA
12: 5 2 11 NA
13: 5 3 12 NA
merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table id age day y
1: 1 21 NA NA
2: 2 28 1 1
3: 2 28 2 2
4: 2 28 3 3
5: 3 32 1 4
6: 3 32 2 5
7: 3 32 3 6
8: 3 32 4 7
9: 4 23 1 8
10: 4 23 2 9
11: 5 NA 1 10
12: 5 NA 2 11
13: 5 NA 3 12
For very large data tables, giving the data tables keys will speed execution, e.g.:
setkey(d, id)
setkey(d, state, city)
Join/merge can be used for data lookups:
s <- data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)
stateAbbrevs <- data.table(state=state.abb, State=state.name)
s[stateAbbrevs, , on=.(st=state), nomatch=NULL] st y State
1: AL 5 Alabama
2: AK 4 Alaska
3: AZ 3 Arizona
4: CA 2 California
5: OK 1 Oklahoma
To reshape data from long to wide format, take the L data table above as an example.
w <- dcast(L, id ~ day, value.var='y')
w id 1 2 3 4
1: 2 1 2 3 NA
2: 3 4 5 6 7
3: 4 8 9 NA NA
4: 5 10 11 12 NA
# Now reverse the procedure
m <- melt(w, id.vars='id', variable.name='day', value.name='y')
m id day y
1: 2 1 1
2: 3 1 4
3: 4 1 8
4: 5 1 10
5: 2 2 2
6: 3 2 5
7: 4 2 9
8: 5 2 11
9: 2 3 3
10: 3 3 6
11: 4 3 NA
12: 5 3 12
13: 2 4 NA
14: 3 4 7
15: 4 4 NA
16: 5 4 NA
setkey(m, id, day) # sorts
m[! is.na(y)] id day y
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 3 1 4
5: 3 2 5
6: 3 3 6
7: 3 4 7
8: 4 1 8
9: 4 2 9
10: 5 1 10
11: 5 2 11
12: 5 3 12
For analysis the sky is the limit. I take advantage of special formatting for model fit objects from the rms package by using html or latex methods and putting results='asis' in the chunk header to preserve the special formatting. Here is an example.
require(rms)
options(prType='html') # needed to use special formatting (can use prType='latex')
set.seed(1)
n <- 1000
w <- data.table(x=runif(n), x2=rnorm(n), y=sample(0:1, n, replace=TRUE))
dd <- datadist(w); options(datadist='dd') # rms needs for summaries and plotting
f <- lrm(y ~ x + rcs(x2, 4), data=w)
fLogistic Regression Model
lrm(formula = y ~ x + rcs(x2, 4), data = w)
|
Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
|---|---|---|---|
| Obs 1000 | LR χ2 1.91 | R2 0.003 | C 0.526 |
| 0 492 | d.f. 4 | R24,1000 0.000 | Dxy 0.051 |
| 1 508 | Pr(>χ2) 0.7516 | R24,749.8 0.000 | γ 0.051 |
| max |∂log L/∂β| 2×10-14 | Brier 0.249 | τa 0.026 |
| β | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| Intercept | 0.1079 | 0.2976 | 0.36 | 0.7169 |
| x | 0.1506 | 0.2197 | 0.69 | 0.4931 |
| x2 | 0.0502 | 0.2142 | 0.23 | 0.8147 |
| x2’ | -0.3705 | 0.6631 | -0.56 | 0.5763 |
| x2’’ | 1.3624 | 2.6219 | 0.52 | 0.6033 |
anova(f)
Wald Statistics for y
|
|||
| χ2 | d.f. | P | |
|---|---|---|---|
| x | 0.47 | 1 | 0.4931 |
| x2 | 1.44 | 3 | 0.6961 |
| Nonlinear | 0.33 | 2 | 0.8499 |
| TOTAL | 1.91 | 4 | 0.7523 |
The workhorse behind Rmarkdown and Quarto is knitr, which processes the code chunks and properly mingles code and tabular and graphical output. knitr has a built-in caching mechanism to make it so that code is not needlessly executed when the code inputs have not changed. This easy-to-use process does have two disadvantages: the dependencies are not transparent, and the stored cache files may be quite large. I like to take control of caching. To that end, the runifChanged function was written. Here is an example of its use. First a function with no arguments must be composed. This is the (usually slow) function that will be conditionally run if any of a group of listed objects has changed since the last time it was run. This function when needed to be run produces an object that is stored in binary form in a user-specified file (the default file name is the name of the current R code chunk with .rds appended).
# Read the source code for the hashCheck and runifChanged functions from
# https://github.com/harrelfe/rscripts/blob/master/hashCheck.r
getRs('hashCheck.r', put='source')
g <- function() {
# Fit a logistic regression model and bootstrap it 500 times, saving
# the matrix of bootstrapped coefficients
f <- lrm(y ~ x1 + x2, x=TRUE, y=TRUE, data=dat)
bootcov(f, B=500)
}
set.seed(3)
n <- 2000
dat <- data.table(x1=runif(n), x2=runif(n),
y=sample(0:1, n, replace=TRUE))
# runifChanged will write runifch.rds if needed (chunk name.rds)
# Will run if dat or source code for lrm or bootcov change
b <- runifChanged(g, dat, lrm, bootcov)
dim(b$boot.Coef)[1] 500 3
head(b$boot.Coef) Intercept x1 x2
[1,] 0.02007292 -0.30079958 0.32416398
[2,] 0.06150624 -0.35741054 0.25522669
[3,] 0.25225861 -0.40094541 0.09290729
[4,] 0.13766665 -0.48661991 0.19684403
[5,] -0.22018456 0.02132711 0.33973578
[6,] 0.18217417 -0.36140896 -0.04873320
The runParallel function makes it easy to use available multiprocessor cores to speed up parallel computations especially for simulations. By default it runs the number of available cores, less one. runParallel makes the parallel package easier to use and does recombinations over per-core batches. The user writes a function that does the work on one core, and the same function is run on all cores. This function has set arguments and must return a named list. A base random number seed is given, and the seed is set to this plus i for core number i. The total number of repetitions is given, and this most balanced possible number of repetitions is run on each core to sum to the total desired number of iterations. runifChanged is again used, to avoid running the simulations if no inputs have changed.
# Load runParallel function from github
getRs('runParallel.r', put='source')
# Function to do simulations on one core
run1 <- function(reps, showprogress, core) {
cof <- matrix(NA, nrow=reps, ncol=3,
dimnames=list(NULL, c('a', 'b1', 'b2')))
for(i in 1 : reps) {
y <- sample(0:1, n, replace=TRUE)
f <- lrm(y ~ X)
cof[i, ] <- coef(f)
}
list(coef=cof)
}
# Debug one core run, with only 3 iterations
n <- 300
seed <- 3
set.seed(seed)
X <- cbind(x1=runif(n), x2=runif(n)) # condition on covariates
run1(3)$coef
a b1 b2
[1,] -0.5455330 0.9572896 0.2215974
[2,] -0.2459193 0.3081405 0.1284809
[3,] -0.1391344 -0.2668562 0.1792319
nsim <- 5000
g <- function() runParallel(run1, reps=nsim, seed=seed)
Coefs <- runifChanged(g, X, run1, nsim, seed)
dim(Coefs)[1] 5000 3
apply(Coefs, 2, mean) a b1 b2
0.0020121803 -0.0007277216 -0.0003258610
The following output is created by the command markupSpecs$html$session(), where markupSpecs is defined in the Hmisc package.
R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Pop!_OS 21.10 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rms_6.3-1 SparseM_1.81 data.table_1.13.6 Hmisc_4.7-0 [5] ggplot2_3.3.3 Formula_1.2-4 survival_3.2-13 lattice_0.20-45To cite R in publications use:
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2022). Hmisc: Harrell Miscellaneous. R package version 4.7-0, https://hbiostat.org/R/Hmisc/.
To cite the rms package in publications use:Harrell Jr FE (2022). rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/, https://github.com/harrelfe/rms.
To cite the data.table package in publications use:Dowle M, Srinivasan A (2020). data.table: Extension of 'data.frame'. R package version 1.13.6, https://CRAN.R-project.org/package=data.table.
To cite the ggplot2 package in publications use:Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
To cite the survival package in publications use:Therneau T (2021). A Package for Survival Analysis in R. R package version 3.2-13, https://CRAN.R-project.org/package=survival.